Site Reliability Engineering (SRE)

Technology Estepona, Spain

Description

Position at Ant Group

Key Responsibilities

• Ensuring Payment System Stability and High Availability: Lead technical initiatives to strengthen the reliability of our payment systems. This includes designing

and implementing monitoring tools, logging frameworks, dashboards, diagnostic utilities, and disaster recovery plans. Conduct routine drills, develop contingency

strategies, and participate in on-call rotations to ensure rapid response and resolution of production issues across regions.

• Incident Handling and Emergency Response: Conduct routine drills, develop contingency strategies, and participate in on-call rotations to ensure rapid response

and resolution of production issues.

• Analyze and Optimize Production Issues: Investigate and analyze real-world production cases, such as performance bottlenecks or system inefficiencies, to derive

actionable insights and establish technical best practices. Contribute to the evolution of a highly available and resilient payment architecture.

• Design and Implement Infrastructure Solutions: Architect and set up new Internet Data Centers (IDCs) to meet scalability and performance requirements. Develop

and execute comprehensive data protection plans that adhere to industry standards and compliance requirements, ensuring data integrity and security.

Technical Requirements

• Solid knowledge of Computer Science, and familiar with the principles of Operating System (Unix/Linux), Computer Storage, Computer Networking and other

related principles.

• Proficient in at least one programming language, such as Java/Python/Shell with experience in developing operations and maintenance tools.

• The strong ability to resolve system problems, good communication skills and a sense of ownership.

• Experiences in operating Google Cloud Platform (GCP) / Oracle Cloud Infrastructure(OCI), OLAP platform (like DPDI, Flink, AntSpark), OcenBase (OB), Ant Trust-Native Service (ATS) is a plus.

Apply Apply Later

← Back to Current Openings

Site Reliability Engineering (SRE)

Description

Share